{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "17 - Train without labels",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Pjmz-RORV8E"
      },
      "source": [
        "# Train without labels\n",
        "\n",
        "_This notebook is part of a tutorial series on [txtai](https://github.com/neuml/txtai), an AI-powered semantic search platform._\n",
        "\n",
        "Almost all data available is unlabeled. Labeled data takes effort to manually review and/or takes time to collect. Zero-shot classification takes existing large language models and runs a similarity comparison between candidate text and a list of labels. This has been shown to perform surprisingly well.\n",
        "\n",
        "The problem with zero-shot classifiers is that they need to have a large number of parameters (400M+) to perform well against general tasks, which comes with sizable hardware requirements.\n",
        "\n",
        "This notebook explores using zero-shot classifiers to build training data for smaller models. A simple form of [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation). "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dk31rbYjSTYm"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XMQuuun2R06J"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai datasets pandas"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3PUe1OW8IZR5"
      },
      "source": [
        "# Apply zero-shot classifier to unlabeled text\n",
        "\n",
        "The following section takes a small 1000 record random sample of the sst2 dataset and applies a zero-shot classifer to the text. The labels are ignored. This dataset was chosen only to be able to evaluate the accuracy at then end. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GlrOnS4cmkih"
      },
      "source": [
        "import random\n",
        "\n",
        "from datasets import load_dataset\n",
        "\n",
        "from txtai.pipeline import Labels\n",
        "\n",
        "def batch(texts, size):\n",
        "    return [texts[x : x + size] for x in range(0, len(texts), size)]\n",
        "\n",
        "# Set random seed for repeatable sampling\n",
        "random.seed(42)\n",
        "\n",
        "ds = load_dataset(\"glue\", \"sst2\")\n",
        "\n",
        "sentences = random.sample(ds[\"train\"][\"sentence\"], 1000)\n",
        "\n",
        "# Load a zero shot classifier - txtai provides this through the Labels pipeline\n",
        "labels = Labels(\"microsoft/deberta-large-mnli\")\n",
        "\n",
        "train = []\n",
        "\n",
        "# Zero-shot prediction using [\"negative\", \"positive\"] labels\n",
        "for chunk in batch(sentences, 32):\n",
        "    train.extend([{\"text\": chunk[x], \"label\": label[0][0]} for x, label in enumerate(labels(chunk, [\"negative\", \"positive\"]))])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TLsZmRpHJGav"
      },
      "source": [
        "Next, we'll use the training set we just built to train a smaller Electra model."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 214
        },
        "id": "nAt42TIHnfTN",
        "outputId": "7080b21d-ecf4-459a-c818-11c748e28bb7"
      },
      "source": [
        "from txtai.pipeline import HFTrainer\n",
        "\n",
        "trainer = HFTrainer()\n",
        "model, tokenizer = trainer(\"google/electra-base-discriminator\", train, num_train_epochs=5)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Some weights of the model checkpoint at google/electra-base-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']\n",
            "- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
            "- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
            "Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight']\n",
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <div>\n",
              "      \n",
              "      <progress value='625' max='625' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
              "      [625/625 02:51, Epoch 5/5]\n",
              "    </div>\n",
              "    <table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: left;\">\n",
              "      <th>Step</th>\n",
              "      <th>Training Loss</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>500</td>\n",
              "      <td>0.282800</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table><p>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J9pugqJSJRn6"
      },
      "source": [
        "# Evaluating accuracy\n",
        "\n",
        "Recall the training set is only 1000 records. To be clear, training an Electra model against the full sst2 dataset would perform better than below. But for this exercise, we're are not using the training labels and simulating labeled data not being available.\n",
        "\n",
        "First, lets see what the baseline accuracy for the zero-shot model would be against the sst2 evaluation set. Reminder that this has not seen any of the sst2 training data. \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RbgIrkgMvJS4",
        "outputId": "69287790-e01c-4c17-dfd5-0dc6afd73c98"
      },
      "source": [
        "labels = Labels(\"microsoft/deberta-large-mnli\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']\n",
            "- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
            "- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-36UBMILpKYh",
        "outputId": "3a340b9f-57c5-4c4c-d975-0fcc47df4930"
      },
      "source": [
        "results = [row[\"label\"] == labels(row[\"sentence\"], [\"negative\", \"positive\"])[0][0] for row in ds[\"validation\"]]\n",
        "sum(results) / len(ds[\"validation\"])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0.8818807339449541"
            ]
          },
          "metadata": {},
          "execution_count": 21
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uJVnWHZZKFIN"
      },
      "source": [
        "88.19% accuracy, not bad for a model that has not been trained on the dataset at all! Shows the power of zero-shot classification.\n",
        "\n",
        "Next, let's test our model trained on the 1000 zero-shot labeled records."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Kr5IZqZtvXlP",
        "outputId": "1faeb0d6-349b-4982-e9e8-cdbbde9e9a09"
      },
      "source": [
        "labels = Labels((model, tokenizer), dynamic=False)\n",
        "\n",
        "results = [row[\"label\"] == labels(row[\"sentence\"])[0][0] for row in ds[\"validation\"]]\n",
        "sum(results) / len(ds[\"validation\"])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0.8738532110091743"
            ]
          },
          "metadata": {},
          "execution_count": 22
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sDw-Zh43KVdX"
      },
      "source": [
        "87.39% accuracy! Wouldn't get too carried away with the percentages but this at least nearly meets the accuracy of the zero-shot classifier.\n",
        "\n",
        "Now this model will be highly tuned for a specific task but it had the opportunity to learn from the combined 1000 records whereas the zero-shot classifier views each record independently. It's also much more performant. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QEAwki2lLM2A"
      },
      "source": [
        "# Conclusion\n",
        "\n",
        "This notebook explored a method of building trained text classifiers without training data being available. Given the amount of resources needed to run large-scale zero-shot classifiers, this method is a simple way to build smaller models tuned for specific tasks. In this example, the zero-shot classifier has 400M parameters and the trained text classifier has 110M. "
      ]
    }
  ]
}
